feat: implement rae autoencoder. by Ando233 · Pull Request #13046 · huggingface/diffusers

Ando233 · 2026-01-28T12:31:05Z

What does this PR do?

This PR adds a new representation autoencoder implementation, AutoencoderRAE, to diffusers.
Implements diffusers.models.autoencoders.autoencoder_rae.AutoencoderRAE with a frozen pretrained vision encoder (DINOv2 / SigLIP2 / ViT-MAE) and a ViT-MAE style decoder.
The decoder implementation is aligned with the RAE-main GeneralDecoder parameter structure, enabling loading of existing trained decoder checkpoints (e.g. model.pt) without key mismatches when encoder/decoder settings are consistent.
Adds unit/integration tests under diffusers/tests/models/autoencoders/test_models_autoencoder_rae.py.
Registers exports so users can import directly via from diffusers import AutoencoderRAE.

Fixes #13000

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case: RAE support #13000
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Usage

ae = AutoencoderRAE(
    encoder_cls="dinov2",
    encoder_name_or_path=encoder_path,
    image_size=image_size,
    encoder_input_size=image_size,
    patch_size=patch_size,
    num_patches=num_patches,
    decoder_hidden_size=1152,
    decoder_num_hidden_layers=28,
    decoder_num_attention_heads=16,
    decoder_intermediate_size=4096,
).to(device)
ae.eval()

state = torch.load(args.decoder_ckpt, map_location="cpu")
ae.decoder.load_state_dict(state, strict=False)

with torch.no_grad():
    recon = ae(x).sample

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sayakpaul · 2026-01-30T11:32:07Z

@bytetriper if you could take a look?

kashif · 2026-01-30T14:26:23Z

nice works @Ando233 checking

kashif · 2026-01-30T14:47:34Z

off the bat,

let's have a nice convention for the output datatype classes, have a look at the other autoencoder for the convention in difusers
some of the tests might need to be marked as slow and some paths are hard-coded

lets sort out these things and then re-look

bytetriper · 2026-01-31T01:51:23Z

Agree with @kashif . Also if possible we can bake all the params into config so we can enable .from_pretrained(), which is more elegant and aligns with diffusers usage. I can help convert our released ckpt to hgf format afterwards

sayakpaul · 2026-01-31T03:29:30Z

@Ando233 we're happy to provide assistance if needed.

kashif · 2026-02-15T23:39:12Z

@Ando233 the one remaining thing is the use of the use_encoder_loss and perhaps an example real-world training script

kashif · 2026-02-15T23:40:25Z

@bytetriper could you kindly try to run the conversion scripts and upload the diffusers style weights to your huggingface hub for the checkpoints you have?

Ando233 · 2026-02-17T15:06:00Z

Thank you for efforts @kashif , let me try to implement the remaining use_encoder_loss and real-world training script

kashif · 2026-02-17T16:15:59Z

@Ando233 I added that already, so next we can wait for @bytetriper for a review and see if the weight conversion works on his end

bytetriper · 2026-02-22T21:27:33Z

Thanks for the implementation! I just checked and weight conversion works on my end. Converted models are under https://huggingface.co/collections/nyu-visionx/rae. @kashif @Ando233 Can you check whether the converted models work on your end?

sayakpaul · 2026-02-23T02:50:30Z

@bytetriper thanks! What would be the quickest way to validate if the implementation is correct? We can do a quick value assertion test between the original model and the converted model on the same inputs. Would you be able to do it?

sayakpaul

Left a bunch of comments. The major thing is we need to be a bit more explicit in terms of how we're defining the configs, loading encoder state dicts, etc.

I think we could aim for the following entrypoint for instantiating the AutoencoderRAE class:

AutoencoderRAE(..., encoder_type="dinov2")

Inside the implementation of AutoencoderRAE __init__(), specifically, we can have a simple if/else block to dispatch the encoder based on encoder_type.:

if encoder_type == "dinov2":
    encoder = Dinov2Encoder()
elif encoder_type == "siglip2":
    encoder = Siglip2Encoder()
...

And then, when a user does AutoencoderRAE.from_pretrained(...), the state dict should have both the encoder and decoder state dict, following how it's done in the other Autoencoder implementations of diffusers.

I will also let @dg845 take a look and provide feedback.

sayakpaul · 2026-02-23T02:52:33Z